43 research outputs found
Color naming guided intrinsic image decomposition
Intrinsic image decomposition is a severely under-constrained problem. User
interactions can help to reduce the ambiguity of the decomposition
considerably. The traditional way of user interaction is to draw scribbles that
indicate regions with constant reflectance or shading. However the effect
scopes of the scribbles are quite limited, so dozens of scribbles are often
needed to rectify the whole decomposition, which is time consuming. In this
paper we propose an efficient way of user interaction that users need only to
annotate the color composition of the image. Color composition reveals the
global distribution of reflectance, so it can help to adapt the whole
decomposition directly. We build a generative model of the process that the
albedo of the material produces both the reflectance through imaging and the
color labels by color naming. Our model fuses effectively the physical
properties of image formation and the top-down information from human color
perception. Experimental results show that color naming can improve the
performance of intrinsic image decomposition, especially in cleaning the
shadows left in reflectance and solving the color constancy problem
Long-term Multi-granularity Deep Framework for Driver Drowsiness Detection
For real-world driver drowsiness detection from videos, the variation of head
pose is so large that the existing methods on global face is not capable of
extracting effective features, such as looking aside and lowering head.
Temporal dependencies with variable length are also rarely considered by the
previous approaches, e.g., yawning and speaking. In this paper, we propose a
Long-term Multi-granularity Deep Framework to detect driver drowsiness in
driving videos containing the frontal faces. The framework includes two key
components: (1) Multi-granularity Convolutional Neural Network (MCNN), a novel
network utilizes a group of parallel CNN extractors on well-aligned facial
patches of different granularities, and extracts facial representations
effectively for large variation of head pose, furthermore, it can flexibly fuse
both detailed appearance clues of the main parts and local to global spatial
constraints; (2) a deep Long Short Term Memory network is applied on facial
representations to explore long-term relationships with variable length over
sequential frames, which is capable to distinguish the states with temporal
dependencies, such as blinking and closing eyes. Our approach achieves 90.05%
accuracy and about 37 fps speed on the evaluation set of the public NTHU-DDD
dataset, which is the state-of-the-art method on driver drowsiness detection.
Moreover, we build a new dataset named FI-DDD, which is of higher precision of
drowsy locations in temporal dimension
Learning Fixation Point Strategy for Object Detection and Classification
We propose a novel recurrent attentional structure to localize and recognize
objects jointly. The network can learn to extract a sequence of local
observations with detailed appearance and rough context, instead of sliding
windows or convolutions on the entire image. Meanwhile, those observations are
fused to complete detection and classification tasks. On training, we present a
hybrid loss function to learn the parameters of the multi-task network
end-to-end. Particularly, the combination of stochastic and object-awareness
strategy, named SA, can select more abundant context and ensure the last
fixation close to the object. In addition, we build a real-world dataset to
verify the capacity of our method in detecting the object of interest including
those small ones. Our method can predict a precise bounding box on an image,
and achieve high speed on large images without pooling operations. Experimental
results indicate that the proposed method can mine effective context by several
local observations. Moreover, the precision and speed are easily improved by
changing the number of recurrent steps. Finally, we will open the source code
of our proposed approach
Topic-Guided Attention for Image Captioning
Attention mechanisms have attracted considerable interest in image captioning
because of its powerful performance. Existing attention-based models use
feedback information from the caption generator as guidance to determine which
of the image features should be attended to. A common defect of these attention
generation methods is that they lack a higher-level guiding information from
the image itself, which sets a limit on selecting the most informative image
features. Therefore, in this paper, we propose a novel attention mechanism,
called topic-guided attention, which integrates image topics in the attention
model as a guiding information to help select the most important image
features. Moreover, we extract image features and image topics with separate
networks, which can be fine-tuned jointly in an end-to-end manner during
training. The experimental results on the benchmark Microsoft COCO dataset show
that our method yields state-of-art performance on various quantitative
metrics.Comment: Accepted by ICIP 201
Consistency-aware Shading Orders Selective Fusion for Intrinsic Image Decomposition
We address the problem of decomposing a single image into reflectance and
shading. The difficulty comes from the fact that the components of image---the
surface albedo, the direct illumination, and the ambient illumination---are
coupled heavily in observed image. We propose to infer the shading by ordering
pixels by their relative brightness, without knowing the absolute values of the
image components beforehand. The pairwise shading orders are estimated in two
ways: brightness order and low-order fittings of local shading field. The
brightness order is a non-local measure, which can be applied to any pair of
pixels including those whose reflectance and shading are both different. The
low-order fittings are used for pixel pairs within local regions of smooth
shading. Together, they can capture both global order structure and local
variations of the shading. We propose a Consistency-aware Selective Fusion
(CSF) to integrate the pairwise orders into a globally consistent order. The
iterative selection process solves the conflicts between the pairwise orders
obtained by different estimation methods. Inconsistent or unreliable pairwise
orders will be automatically excluded from the fusion to avoid polluting the
global order. Experiments on the MIT Intrinsic Image dataset show that the
proposed model is effective at recovering the shading including deep shadows.
Our model also works well on natural images from the IIW dataset, the UIUC
Shadow dataset and the NYU-Depth dataset, where the colors of direct lights and
ambient lights are quite different
SymmNet: A Symmetric Convolutional Neural Network for Occlusion Detection
Detecting the occlusion from stereo images or video frames is important to
many computer vision applications. Previous efforts focus on bundling it with
the computation of disparity or optical flow, leading to a chicken-and-egg
problem. In this paper, we leverage convolutional neural network to liberate
the occlusion detection task from the interleaved, traditional calculation
framework. We propose a Symmetric Network (SymmNet) to directly exploit
information from an image pair, without estimating disparity or motion in
advance. The proposed network is structurally left-right symmetric to learn the
binocular occlusion simultaneously, aimed at jointly improving both results.
The comprehensive experiments show that our model achieves state-of-the-art
results on detecting the stereo and motion occlusion.Comment: BMVC 2018 Camera-read
Multi-Kernel Correntropy for Robust Learning
As a novel similarity measure that is defined as the expectation of a kernel
function between two random variables, correntropy has been successfully
applied in robust machine learning and signal processing to combat large
outliers. The kernel function in correntropy is usually a zero-mean Gaussian
kernel. In a recent work, the concept of mixture correntropy (MC) was proposed
to improve the learning performance, where the kernel function is a mixture
Gaussian kernel, namely a linear combination of several zero-mean Gaussian
kernels with different widths. In both correntropy and mixture correntropy, the
center of the kernel function is, however, always located at zero. In the
present work, to further improve the learning performance, we propose the
concept of multi-kernel correntropy (MKC), in which each component of the
mixture Gaussian kernel can be centered at a different location. The properties
of the MKC are investigated and an efficient approach is proposed to determine
the free parameters in MKC. Experimental results show that the learning
algorithms under the maximum multi-kernel correntropy criterion (MMKCC) can
outperform those under the original maximum correntropy criterion (MCC) and the
maximum mixture correntropy criterion (MMCC).Comment: 10 pages, 5 figure
Salient Object Detection: A Discriminative Regional Feature Integration Approach
Salient object detection has been attracting a lot of interest, and recently
various heuristic computational models have been designed. In this paper, we
formulate saliency map computation as a regression problem. Our method, which
is based on multi-level image segmentation, utilizes the supervised learning
approach to map the regional feature vector to a saliency score. Saliency
scores across multiple levels are finally fused to produce the saliency map.
The contributions lie in two-fold. One is that we propose a discriminate
regional feature integration approach for salient object detection. Compared
with existing heuristic models, our proposed method is able to automatically
integrate high-dimensional regional saliency features and choose discriminative
ones. The other is that by investigating standard generic region properties as
well as two widely studied concepts for salient object detection, i.e.,
regional contrast and backgroundness, our approach significantly outperforms
state-of-the-art methods on six benchmark datasets. Meanwhile, we demonstrate
that our method runs as fast as most existing algorithms
Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association
Person re-identification is an important task that requires learning
discriminative visual features for distinguishing different person identities.
Diverse auxiliary information has been utilized to improve the visual feature
learning. In this paper, we propose to exploit natural language description as
additional training supervisions for effective visual features. Compared with
other auxiliary information, language can describe a specific person from more
compact and semantic visual aspects, thus is complementary to the pixel-level
image data. Our method not only learns better global visual feature with the
supervision of the overall description but also enforces semantic consistencies
between local visual and linguistic features, which is achieved by building
global and local image-language associations. The global image-language
association is established according to the identity labels, while the local
association is based upon the implicit correspondences between image regions
and noun phrases. Extensive experiments demonstrate the effectiveness of
employing language as training supervisions with the two association schemes.
Our method achieves state-of-the-art performance without utilizing any
auxiliary information during testing and shows better performance than other
joint embedding methods for the image-language association.Comment: ECC
End-to-end Lane Shape Prediction with Transformers
Lane detection, the process of identifying lane markings as approximated
curves, is widely used for lane departure warning and adaptive cruise control
in autonomous vehicles. The popular pipeline that solves it in two steps --
feature extraction plus post-processing, while useful, is too inefficient and
flawed in learning the global context and lanes' long and thin structures. To
tackle these issues, we propose an end-to-end method that directly outputs
parameters of a lane shape model, using a network built with a transformer to
learn richer structures and context. The lane shape model is formulated based
on road structures and camera pose, providing physical interpretation for
parameters of network output. The transformer models non-local interactions
with a self-attention mechanism to capture slender structures and global
context. The proposed method is validated on the TuSimple benchmark and shows
state-of-the-art accuracy with the most lightweight model size and fastest
speed. Additionally, our method shows excellent adaptability to a challenging
self-collected lane detection dataset, showing its powerful deployment
potential in real applications. Codes are available at
https://github.com/liuruijin17/LSTR.Comment: 9 pages, 7 figures, accepted by WACV 202